NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

Warraich, Ertza; Shabtai, Omer; Manaa, Khalid; Vargaftik, Shay; Piasetzky, Yonatan; Kadosh, Matty; Suresh, Lalith; Shahbaz, Muhammad (April 2025, 22nd USENIX Symposium on Networked Systems Design and Implementation)

We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients—providing an efficient balance between (tail) performance and the resulting accuracy of the trained models. Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs’ tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that OptiReduce achieves 70% and 30% faster time-to-accuracy (TTA), on average, when operating in shared, cloud environments (e.g., CloudLab) compared to Gloo and NCCL, respectively.
more » « less
Free, publicly-accessible full text available April 28, 2026
Anvil: Verifying Liveness of Cluster Management Controllers

Sun, Xudong; Ma, Wenjie; Gu, Jiawei Tyler; Ma, Zicheng; Chajed, Tej; Howell, Jon; Lattuada, Andrea; Padon, Oded; Suresh, Lalith; Szekeres, Adriana; et al (July 2024, 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI'24))

Full Text Available
Automatic Reliability Testing for Cluster Management Controllers

Sun, Xudong; Luo, Wenqing; Gu, Jiawei Tyler; Ganesan, Aishwarya; Alagappan, Ramnatthan; Gasch, Michael; Suresh, Lalith; Xu, Tianyin (July 2022, Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22))

Modern cluster managers like Borg, Omega and Kubernetes rely on the state-reconciliation principle to be highly resilient and extensible. In these systems, all cluster-management logic is embedded in a loosely coupled collection of microservices called controllers. Each controller independently observes the current cluster state and issues corrective actions to converge the cluster to a desired state. However, the complex distributed nature of the overall system makes it hard to build reliable and correct controllers – we find that controllers face myriad reliability issues that lead to severe consequences like data loss, security vulnerabilities, and resource leaks. We present Sieve, the first automatic reliability-testing tool for cluster-management controllers. Sieve drives controllers to their potentially buggy corners by systematically and extensively perturbing the controller’s view of the current cluster state in ways it is expected to tolerate. It then compares the cluster state’s evolution with and without perturbations to detect safety and liveness issues. Sieve’s design is powered by a fundamental opportunity in state-reconciliation systems – these systems are based on state-centric interfaces between the controllers and the cluster state; such interfaces are highly transparent and thereby enable fully-automated reliability testing. To date, Sieve has efficiently found 46 serious safety and liveness bugs (35 confirmed and 22 fixed) in ten popular controllers with a low false-positive rate of 3.5%.
more » « less
Full Text Available
Reasoning about modern datacenter infrastructures using partial histories

https://doi.org/10.1145/3458336.3465276

Sun, Xudong; Suresh, Lalith; Ganesan, Aishwarya; Alagappan, Ramnatthan; Gasch, Michael; Tang, Lilia; Xu, Tianyin (June 2021, In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII))
null (Ed.)
Modern datacenter infrastructures are increasingly architected as a cluster of loosely coupled services. The cluster states are typically maintained in a logically centralized, strongly consistent data store (e.g., ZooKeeper, Chubby and etcd), while the services learn about the evolving state by reading from the data store, or via a stream of notifications. However, it is challenging to ensure services are correct, even in the presence of failures, networking issues, and the inherent asynchrony of the distributed system. In this paper, we identify that partial histories can be used to effectively reason about correctness for individual services in such distributed infrastructure systems. That is, individual services make decisions based on observing only a subset of changes to the world around them. We show that partial histories, when applied to distributed infrastructures, have immense explanatory power and utility over the state of the art. We discuss the implications of partial histories and sketch tooling for reasoning about distributed infrastructure systems.
more » « less
Full Text Available

Search for: All records